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Abstract 

This  paper  summarizes  an  ongoing  data-fusion  project 
the  purpose  of  which  is  to  identify  patterns  from  object- 
oriented  databases  derived from  message  traffic  generated 
during  U.S.  Marine  Corps  exercises .  These  patterns  will 
be  used  to  predict  attacks  during  wartime  and  other  peri¬ 
ods  of  tension  or  conflict.  The  paper  describes  a  concept 
of  operations ,  a  literature  survey  of  relevant  data-mining 
and  classifer  algorithms ,  plans  for  system  development 
and  directions  for  future  research. 

Keywords:  Bayesian  networks,  classifiers,  data  fusion, 
data  mining  algorithms,  military  exercise,  object-oriented 
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1.  Introduction 

The  ability  to  predict  attacks  and  other  hostile  events 
during  times  of  conflict  is  very  desirable  to  military  com¬ 
manders  from  the  standpoint  of  readiness.  The  more  ad¬ 
vanced  notice  and  the  more  widespread  the  notification,  the 
better  able  all  echelons  are  to  respond  to  threats  efficiently 
and  with  the  correct  combination  of  forces. 

The  literature  is  replete  with  recent  research  results  on 
data  mining  and  data  classification.  (See,  for  example,  [2, 
3,  4,  and  7].)  Data  mining,  data  classification  and  data 
correlation  are  among  the  many  techniques  used  to  achieve 
data  fusion.  As  these  techniques  mature,  better  tools  be¬ 
come  available  to  model  and  correlate  data  from  complex 
operational  scenarios. 

2.  Concept  of  operation 

The  concept  of  operations  is  to  use  data-mining  and 
data-classification  algorithms  to  detect  patterns  associated 
with  attacks  (e.g.  to  identify  factors  that  indicate  an  immi¬ 
nent  attack  in  the  near  future)  and  to  correlate  them  with 
current  events  with  a  view  toward  supplying  military 


S.  J.  McCarthy,  Ph.D. 

Space  and  Naval  Warfare  Systems  Center,  D432 
53560  Hull  Street 
San  Diego,  CA  92152-5001,  USA 
+1  619  553  1520 
mccarthy@spawar.navy.mil 


commanders  with  a  prediction  of  the  next  attack  and  a  con¬ 
fidence  level  that  pertains  to  that  prediction.  A  considerable 
amount  of  data  associated  with  events  that  have  preceded 
know  attacks  is  required  to  model  attacks,  to  search  for 
common  features,  and  to  find  these  patterns  in  new  data. 

Success  in  this  effort  depends  on  a  characterization  of  the 
circumstances  that  translate  to  well-defined  observables  that 
preceded  past  attacks.  The  more  detailed  the  available 
knowledge,  the  greater  the  probability  that  data  instantiat¬ 
ing  critical  variables  will  be  collected.  It  is  expected  that 
such  detailed  data  for  all  variables  will  not  be  available 
prior  to  future  attacks  and  that  all  available  data  may  not  be 
useful  in  predicting  attacks  (noise).  Thus,  the  task  involves 
identification  of  algorithms  that  can  operate  on  incomplete 
data;  detection  of  pre-attack  features  in  clutter,  and  pattern 
recognition.  This  project  can  be  successful  because  modem 
methods  of  statistical  pattern  recognition  are  sufficiently 
computationally  oriented  to  use  a  larger  dimensional  space 
and  because  they  are  less  sensitive  to  noise. 

3.  Approach 

The  approach  will  include  an  examination  of  Marine  bat¬ 
tlefield  intelligence  requirements,  from  which  a  list  of  sev¬ 
eral  specific  requirements  for  data  fusion  with  an  audit  trail 
will  be  generated.  Hostile  events  will  be  characterized  with 
respect  to  as  many  relevant  variables  as  are  deemed  neces¬ 
sary  to  predict  future  attacks.  Message-traffic  databases  will 
be  analyzed  for  the  occurrence  of  telltale  signs  of  pending 
attacks.  A  goal  is  to  generate  an  event  prediction  (in  terms 
of  a  probability)  with  a  confidence  value  associated  with  it. 
Therefore,  it  is  necessary  to  know  which  combinations  cf 
events  and  observations  will  have  a  higher  probability  cf 
indicating  a  future  attack.  A  baseline  can  be  modeled  from 
normal  operational  scenarios. 

The  attack  alarm-generation  process  and  the  reduction  cf 
false  positives  can  be  accomplished  using  constraints  from 
models  of  known  attacks.  One  approach  is  to  explore  in  the 
generation  of  a  knowledge  base  based  on  Bayesian  net¬ 
works. 


The  generation  of  the  appropriate  features  that  can  serve 
to  flag  immanent  attacks  is  the  least  scientific  and  most 
challenging  part  of  the  process.  A  literature  search  will  be 
conducted  to  determine  whether  the  U.S.  Navy  or  the  U.S. 
Army  has  made  any  progress  in  this  area  that  the  Marine 
Corps  can  use. 

A  literature  search  for  data-mining  algorithms  also  is  in 
progress  with  emphasis  on  algorithms  designed  to  operate 
on  sparse  data  or  data  exceptions.  These  data-mining  algo¬ 
rithms  will  be  used  to  identify  complex  patterns  in  the  data 
that  correlate  well  to  hostile  events.  Criteria  for  sufficient 
correlation  and  confidence  levels  in  data  associations  will 
be  developed.  Correlation  strength ,  one  metric  that  could 
be  used,  is  the  ratio  of  the  joint  probability  to  the  individ¬ 
ual  probability  of  observing  a  pattern  [2]. 

The  Space  and  Naval  Warfare  Systems  Center  has  ac¬ 
cess  to  SRI’s  classifier  algorithms.  For  example,  the  Tree- 
Augmented  Naive  Bayes  (TAN)  is  a  classifier  algorithm 
based  on  Bayesian  networks  developed  at  SRI  with  the 
advantages  of  robustness  and  polynomial  computational 
complexity  [3  and  4].  Bayesian  networks  are  a  suitable 
technology  for  the  following  reasons: 

•  First,  one  need  not  provide  all  joint  probability  prob¬ 
ability  values  to  specify  a  probability  distribution  for  col¬ 
lections  of  independent  variables  [1]. 

•  Second,  one  could  mix  modeling,  e.g.  explicit  knowl¬ 
edge  engineering  for  knowledge  elicited  from  experts,  with 
statistical  data  induction  and  adaptivity.  This  will  require 
fewer  data  values  to  induce  better  quality  models. 

•  Third,  one  could  use  these  models  to  compute  the 
value  of  information.  For  example,  having  seen  signs  "A" 
and  "B"  of  an  imminent  attack,  what  is  the  best  informa¬ 
tion  to  collect  next  to  confirm  that  hypothesis? 

•  Fourth,  one  could  characterize  explicitly  the  kinds  of 
attacks.  For  example,  given  an  attack  of  type  “D,”  what  are 
the  most  likely  signals?  These  signals  could  be  collected 
regularly  to  fill  the  database  used  as  input  into  TAN. 

TAN  makes  some  tradeoffs  between  accuracy  and  compu¬ 
tation.  It  approximates  a  probability  distribution  using 
some  constraints  on  the  complexity  of  the  representation; 
however,  it  is  extremely  fast  (low  polynomial),  efficient 
(one  pass  over  the  data),  and  robust  (low  order  statistics). 

TAN  accepts  data  sets  as  input  and  induces  Bayesian 
networks  as  output.  Specifically,  TAN  is  intended  to  be 
used  as  a  classification  algorithm,  which  means  that  the 
input  would  be  a  file  with  tuples  of  the  form  {x,,  x2,  x3, 
...,  xn,  c}  where  the  x{,  are  values  that  variable  Xj  takes 
and  c  is  the  value  that  a  class  (C)  variable  can  take.  To  set 
the  range  of  each  variable,  TAN  needs  an  auxiliary  file  that 
includes  a  description  of  each  variable,  including  the  range 
of  values  representing  the  degree  of  intensity. 

TAN’s  output  is  a  Bayesian  network  encoding  of  P(C, 
Xn,...,Xj)  in  an  efficient  manner.  To  use  TAN  as  a  classi¬ 
fier,  one  simply  computes  P(C|x'n,....,x,1).  Given  a  new 
vector  and  having  a  probability  distribution 

over  c,  one  can  select  the  event  with  highest  probability  as 
the  one  to  classify.  To  compute  confidence  on  this,  the 
bootstrap  method  can  be  used  [5]. 


In  addition  to  TAN,  SRI  has  a  more  general  algorithms 
for  inducing  Bayesian  networks  that  do  not  make  the  com¬ 
promises  that  TAN  does.  These  algorithms  try  to  fit  the 
best  distribution  possible  with  no  constraints.  The  disad¬ 
vantage  is  that  the  computation  of  these  models  is  slower; 
however,  this  may  be  acceptable  and  desirable  in  some 
cases.  Algorithms  can  be  implemented  with  the  same  data 
and  the  results  compared. 

4.  Software  implementation 

A  user-friendly  interface  will  be  designed  on  top  of  the 
algorithms  to  facilitate  the  selection  of  the  best  algorithm 
to  use  in  a  given  situation,  and  to  provide  automated  input 
of  selected  data  sets  to  the  algorithm  of  choice.  TAN  will 
be  used  as  a  base  classifier  and  also  as  a  method  to  fuse  the 
output  of  other  data-mining  and  classification  algorithms. 

Most  of  the  data  sets  will  come  from  the  Integrated  Ma¬ 
rine  Multi-agent  Command  and  Control  System 
(IMMACCS)  Database  [6]. 
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